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and Factors in Large Sample Correlation Matrices 
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Through simple analytical calculations and numerical simulations, we demonstrate the generic 
existence of a self-organized macroscopic state in any large multivariate system possessing non- 
vanishing average correlations between a finite fraction of all pairs of elements. The coexistence of 
an eigenvalue spectrum predicted by random matrix theory (RMT) and a few very large eigenvalues 
in large empirical correlation matrices is shown to result from a bottom-up collective effect of the 
underlying time series rather than a top-down impact of factors. Our results, in excellent agreement 
with previous results obtained on large financial correlation matrices, show that there is relevant 
information also in the bulk of the eigenvalue spectrum and rationalize the presence of market factors 
previously introduced in an ad hoc manner. 



Since Wigner's seminal idea to apply random matrix 
theory (RMT) to interpret the complex spectrum of en- 
ergy levels in nuclear physics pL RMT has made enor- 
mous progress Q with many applications in physical sci- 
ences and elsewhere such as in meteorology || and im- 
age processing tl. A new application was proposed a 
few years ago to the problem of correlations between fi- 
nancial assets and to the portfolio optimization problem. 
It was shown that, among the eigenvalues and principal 
components of the empirical correlation matrix of the 
returns of hundreds of asset on the New York Stock Ex- 
change (NYSE), apart from the few highest eigenvalues, 
the marginal distribution of the other eigenvalues and 
eigenvectors closely resembles the spectral distribution 
of a positive symmetric random matrix with maximum 
entropy, suggesting that the correlation matrix does not 
contain any specific information beyond these few largest 
eigenvalues and eigenvectors ||. These results appar- 
ently invalidate the standard mean- variance portfolio op- 
timization theory Q consecrated by the financial indus- 
try and seemingly support the rationale behind factor 
models such as the capital asset pricing model (CAPM) 
[p| and the arbitrage pricing theory (APT) |9), where the 
correlations between a large number of assets are repre- 
sented through a small number of so-called market fac- 
tors. Indeed, if the spectrum of eigenvalues of the em- 
pirical covariance or correlation matrices are predicted 
by RMT, it seems natural to conclude that there is no 
usable information in these matrices and that empiri- 
cal covariance matrices should not be used for portfolio 
optimization. In contrast, if one detects deviations be- 
tween the universal - and therefore non-informative - 
part of the spectral properties of empirically estimated 
covariance and correlation matrices and those of the rele- 
vant ensemble of random matrices, this may quantify the 
amount of real information that can be used in portfolio 



optimization from the "noise" that should be discarded. 

More generally, in many different scientific fields, one 
needs to determine the nature and amount of informa- 
tion contained in large covariance and correlation ma- 
trices. This occurs as soon as one attempts to estimate 
very large covariance and correlation matrices in mul- 
tivariate dynamics of systems exhibiting non-Gaussian 
fluctuations with fat tails and/or long-range time corre- 
lations with intermittency. In such cases, the convergence 
of the estimators of the large covariance and correlation 
matrices is often too slow for all practical purposes. The 
problem becomes even more complex with time- varying 
variances and covariances as occurs in systems with het- 
eroskedasticity or with regime-switching A 
prominent example where such difficulties arise is the 
data-assimilation problem in engineering and in mete- 
orology where forecasting is combined with observations 
iteratively through the Kalman filter, based on the esti- 
mation and forward prediction of large covariance matri- 
ces |2). 

As we said in the context of financial time series, the 
rescuing strategy is to invoke the existence of a few dom- 
inant factors, such as an overall market factor and the 
factors related to firm size, firm industry and book-to- 
market equity, thought to embody most of the relevant 
dependence structure between the studied time series 
]l3t . Indeed, there is no doubt that observed equity prices 
respond to a wide variety of unanticipated factors, but 
there is much weaker evidence that expected returns are 
higher for equities that are more sensitive to these fac- 
tors, as required by Markowitz's mean-variance theory, 
by the CAPM and the APT [|l4|. This severe failure of 
the most fundamental finance theories could conceivably 
be attributable to an inappropriate proxy for the market 
portfolio, but nobody has been able to show that this is 
really the correct explanation. This remark constitutes 
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the crux of the problem: the factors invoked to model the 
cross-sectional dependence between assets are not known 
in general and are either postulated based on economic 
intuition in financial studies or obtained as black box re- 
sults in the recent analyses using RMT ||. 

Here, we show that the existence of factors results from 
a collective effect of the assets, similar to the emergence 
of a macroscopic self-organization of interacting micro- 
scopic constituents. For this, we unravel the general 
physical origin of the large eigenvalues of large covari- 
ance and correlation matrices and provide a complete 
understanding of the coexistence of features resembling 
properties of random matrices and of large "anomalous" 
eigenvalues. Through simple analytical calculations and 
numerical simulations, we demonstrate the generic exis- 
tence of a self-organized macroscopic state in any large 
system possessing non-vanishing average correlations be- 
tween a finite fraction of all pairs of elements. 

Let us first consider a large system of size N with cor- 
relation matrix C in which every non-diagonal pairs of 
elements exhibits the same correlation coefficient CV, = p 
for i =/= j and Cu — 1 . Its eigenvalues are 



Ai = 1 + (N - l)p and A*> 2 = l-p 



(1) 



with multiplicity N — 1 and with p € (0, 1) in order for 
the correlation matrix to remain positive definite. Thus, 
in the thermodynamics limit N — > oo, even for a weak 
positive correlation p — ► (with pN ^> 1), a very large 
eigenvalue appears, associated with the delocalized eigen- 
vector vi — (l/v^V)(l, 1, • • • ,1), which dominates com- 
pletely the correlation structure of the system. This triv- 
ial example stresses that the key point for the emergence 
of a large eigenvalue is not the strength of the correla- 
tions, provided that they do not vanish, but the large size 
N of the system. 

This result (Q) still holds qualitatively when the corre- 
lation coefficients are all distinct. To see this, it is con- 
venient to use a perturbation approach. We thus add a 
small random component to each correlation coefficient: 



Ci 



p + e • <n 



for i =/= j 



(2) 



where the coefficients a,j = a^i have zero mean, variance 
cr 2 and are independently distributed (There are addi- 
tional constraints on the support of the distribution of 
the Ojj's in order for the matrix to remain positive 
definite with probability one) . The determination of the 
eigenvalues and eigenfunctions of Cy is performed using 
the perturbation theory developed in quantum mechan- 
ics |ll| up to the second order in e. We find that the 
largest eigenvalue becomes 



E[Ai] = (iV-l)p+l- 



(N-l)(N-2) e 2 a 2 



N 2 



P 



■+0(e 3 ) (3) 



while, at the same order, the corresponding eigenvector 
V\ remains unchanged. The degeneracy of the eigenvalue 
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FIG. 1: Spectrum of eigenvalues of a random correlation ma- 
trix with average correlation coefficient p — 0.14 and standard 
deviation of the correlation coefficients a = 0.345/ \/N. The 
size TV = 406 of the matrix is the same as in previous stud- 
ies H for the sake of comparison. The continuous curve is 
the theoretical translated semi-circle distribution of eigenval- 
ues describing the bulk of the distribution which passes the 
Kolmogorov test. The center value A = 1 — p ensures the 
conservation of the trace equal to N. There is no adjustable 
parameter. The inset represents the whole spectrum with the 
largest eigenvalue whose size is in agreement with the predic- 
tion pN = 56.8. 



A = 1 — p is broken and leads to a complex set of smaller 
eigenvalues described below. 

In fact, this result (||) can be generalized to the non- 
perturbative domain of any correlation matrix with inde- 
pendent random coefficients CV, , provided that they have 
the same mean value p and variance a 2 . Indeed, it has 
been shown jl(J that, in such a case, the expectations of 
the largest and second largest eigenvalues are 

E[Ai] = (N - 1) • p + 1 + o 2 lp + o(l) , (4) 
E[A 2 ] < 2crViV + 0(/V 1/3 logiV) . (5) 

Moreover, the statistical fluctuations of these two largest 
eigenvalues are asymptotically (for large fluctuations t > 
0(y/N)) bounded by a Gaussian distribution according 
to the following large deviation theorem 



Pr{|A lj2 -E[Ai i2 ]| >t} < e 



-Cl,2* 



(6) 



for some positive constant Ci i2 [[T7|. 

This result is very different from that obtained when 
the mean value p vanishes. In such a case, the distribu- 
tion of eigenvalues of the random matrix C is given by 
the semi-circle law ||. However, due to the presence of 
the ones on the main diagonal of the correlation matrix 
C, the center of the circle is not at the origin but at the 
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point A = 1. Thus, the distribution of the eigenvalues of 
random correlation matrices with zero mean correlation 
coefficients is a semi-circle of radius 2ay/N centered at 
A = 1. 

The result (|4|) is deeply related to the so-called "friend- 
ship theorem" in mathematical graph theory, which 
states that, in any finite graph such that any two ver- 
tices have exactly one common neighbor, there is one 
and only one vertex adjacent to all other vertices |l8| . 
A more heuristic but equivalent statement is that, in a 
group of people such that any pair of persons have ex- 
actly one common friend, there is always one person (the 
"politician" ) who is the friend of everybody. The connec- 
tion is established by taking the non-diagonal entries Cy 
(i =/= j) equal to Bernouilli random variable with param- 
eter p, that is, Pr[Cij = 1] = p and Pr[Cij = 0] = 1 — p. 
Then, the matrix Cy — /, where I is the unit matrix, 
becomes nothing but the adjacency matrix of the ran- 
dom graph G(N,p) 0. The proof of || of the "friend- 
ship theorem" indeed relies on the TV-dependence of the 
largest eigenvalue and on the y/N -dependence of the sec- 
ond largest eigenvalue of Cy as given by (|J) and (||). 

Figure [l] shows the distribution of eigenvalues of a 
random correlation matrix. The inset shows the largest 
eigenvalue lying at the predicting size pN — 56.8, while 
the bulk of the eigenvalues are much smaller and are de- 
scribed by a modified semi-circle law centered on A = 
1 — p, in the limit of large N. The result on the largest 
eigenvalue emerging from the collective effect of the cross- 
correlation between all N(N—1)J2 pairs provides a novel 
perspective to the observation |Hj that the only reason- 
able explanation for the simultaneous crash of 23 stock 
markets worldwide in October 1987 is the impact of a 
world market factor: according to our demonstration, 
the simultaneous occurrence of significant correlations 
between the markets worldwide is bound to lead to the 
existence of an extremely large eigenvalue, the world mar- 
ket factor constructed by ... a linear combination of the 
23 stock markets! What our result shows is that invoking 
factors to explain the cross-sectional structure of stock re- 
turns is cursed by the chicken-and-egg problem: factors 
exist because stocks are correlated; stocks are correlated 
because of common factors impacting them. 

Figure || shows the eigenvalues distribution of the sam- 
ple correlation matrix reconstructed by sampling N — 
406 time series of length T = 1309 generated with a given 
correlation matrix C with theoretical spectrum shown in 
figure |l|. The largest eigenvalue is again very close to 
the prediction pN = 56.8 while the bulk of the distribu- 
tion departs very strongly from the semi-circle law and 
is not far from the Wishart prediction, as expected from 
the definition of the Wishart ensemble as the ensemble of 
sample covariance matrices of Gaussian distributed time 
series with unit variance and zero mean. A Kolmogorov 
test shows however that the bulk of the spectrum is not in 
the Wishart class, in contradiction with previous claims 
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FIG. 2: Spectrum estimated from the sample correlation ma- 
trix obtained from N = 406 time series of length T = 1309 
(the same length as in Q) with the same theoretical correla- 
tion matrix as that presented in figure 

lacking formal statistical tests ||. This result holds for 
different simulations of the sample correlation matrix and 
different realizations of the theoretical correlation matrix 
with the same parameters (p, a). The statistically sig- 
nificant departure from the Wishart prediction implies 
that there is actually some information in the bulk of 
the spectrum of eigenvalues, which is intimately coupled 
with the existence of the largest eigenvalue. We have 
also checked that these results remain robust for non- 
Gaussian distribution of returns as long as the second 
moments exist. Indeed, correlated time series with mul- 
tivariate Gaussian or Student distributions with three de- 
grees of freedom (which provide more acceptable proxies 
for financial time series j^O)) give no discernible differ- 
ences in the spectrum of eigenvalues. This is surprising 
as the estimator of a correlation coefficient is asymptoti- 
cally Gaussian for time series with finite fourth moment 
and Levy stable otherwise [pl| . 

Up to now, we have focused on the collective mecha- 
nism at the origin of the very large eigenvalue of order 
N. Empirically ||, a few other eigenvalues have an am- 
plitude of the order of 5 — 10 that deviate significantly 
from the bulk of the distribution. These eigenvalues can- 
not be obtained by a matrix of the form ^ with iden- 
tically independently distributed coefficients ay's. Our 
analysis provides a very simple constructive mechanism 
for them. The solution consists in considering, as a first 
approximation, the block diagonal matrix C' with diag- 
onal elements made of the matrices A\,-- ■ ,A p of sizes 
Ni,-" ,N p with J^iVj = AT, constructed according to 
such that each matrix A; has the average correlation 
coefficient pi. When the coefficients of the matrix C 
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FIG. 3: Spectrum of eigenvalues estimated from the sample 
correlation matrix of N = 406 time series of length T = 1309. 
The times series have been constructed from a multivariate 
Gaussian distribution with a correlation matrix made of three 
block-diagonal matrices of sizes respectively equal to 130, 140 
and 136 and mean correlation coefficients equal to 0.18 for all 
of them. The off-diagonal elements are all equal to 0.1. The 
same results hold if the off-diagonal elements are random. 



outside the matrices Ai are zero, the spectrum of C is 
given by the union of all the spectra of the A^s, which 
are each dominated by a large eigenvalue Ai.i ~ pi ■ AT,. 
The spectrum of C then exhibits p large eigenvalues. 

Each block Ai can be interpreted as a sector of the 
economy, including all the companies belonging to a same 
industrial branch and the eigenvector associated with 
each largest eigenvalue represents the main factor driving 
this sector of activity [£2j . For similar sector sizes Ni and 
average correlation coefficients pi , the largest eigenvalues 
are of the same order of magnitude. In order to recover 
a very large unique eigenvalue, we reintroduce some cou- 
pling constants outside the block diagonal matrices. A 
well-known result of perturbation theory in quantum me- 
chanics states that such coupling leads to a repulsion be- 
tween the eigenstates, which can be observed in figure || 
where C" has been constructed with three block matrices 
A\, A2 and A3 and non-zero off-diagonal coupling de- 
scribed in the figure caption. These values allow us to 
quantitatively replicate the empirical finding of Laloux 
et al. in || , where the three first eigenvalues are approx- 
imately Ai ~ 57, A2 ~ 10 and A3 ~ 8. The bulk of the 
spectrum (which excludes the three largest eigenvalues) 
is similar to the Wishart distribution but again statisti- 
cally different from it as tested with a Kolmogorov test. 
There is thus significant deviation from the predictions 
of RMT not only for the largest eigenvalues but also in 
the bulk. 

As a final remark, expressions ([5]^|) and our numerical 
tests for a large variety of correlation matrices show that 



the delocalized eigenvector v\ = (1/v N)(l, 1, • • • ,1), as- 
sociated with the largest eigenvalue is extremely robust 
and remains (on average) the same for any large sys- 
tem. Thus, even for time-varying correlation matrices 
- as in finance with important heteroskedastic effects - 
the composition of the main factor remains almost the 
same. This can be seen as a generalized limit theorem 
reflecting the bottom-up organization of broadly corre- 
lated time series. 
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